I am analyzing the Red Wine Quality dataset provided by Udacity. The purpose is to detect if any of the physiochemical properties distinguish between excellent, good and poor quality wines.
Created by: Paulo Cortez (Univ. Minho), Antonio Cerdeira, Fernando Almeida, Telmo Matos and Jose Reis (CVRVV) @ 2009
The red wine variant is of the Portuguese “Vinho Verde” wine.
For more details, consult: http://www.vinhoverde.pt/en/ or the reference [Cortez et al., 2009].
Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.)
The dataset was created using red wine samples. The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent).
Before starting my project, I took some time learning about basic wine characteristics. I learnt that the main fundamental traits of wine are sweetness, acidity, tannin, alcohol and body. In addition, I looked through the given red wine dataset to get a feel for the input and output variables.
This tidy data set contains 1,599 red wines with 11 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).
Input variables (based on physicochemical tests):
Output variable (based on sensory data):
I’d like to explore which input variables had an impact on the wine quality (output variable) ratings.
So let’s begin with the data exploration.
Here is the structure of the red wine dataset.
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
There are 1599 observations and 13 variables. ‘quality’ is an output variable. ‘X’ is an observation identifier. ‘quality’ and ‘X’ are integers. The rest of the variables are numeric.
Below you will find statistic information on the mean, median, minimum, maximum, 1st quartile and 3rd quartile on all the variables.
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
My main feature of interest is the quality variable. So, let’s look at the summary of the quality variable.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
## 3 4 5 6 7 8
## 10 53 681 638 199 18
From the summary above, the wine quality is ranging from 3 to 8. The median value is 6. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent). Let’s look at the quality variable on a bar chart.
The quality variable follows a normal distribution shape with discrete integer values. Majority of the wine quality fall mostly in 5 and 6. There are very few exceptionally excellent or poor quality wines. The minimum rating is 3 and the maximum rating is 8 for quality. It is clear that there are much more good wines than excellent or poor wines. In addition, not a single wine received a score of 0,1,2,9 or 10.
## Poor Good Excellent
## 63 1319 217
There are 63 poor quality wines, 1319 good quality wines and 217 excellent quality wines.
The “Good” ratings has far more wines than the “Poor” and “Excellent” ratings. I am surprised that none of the wines had quality level higher than 8 and less than 3. There were very wines in the ‘Poor’ and ‘Excellent’ ratings.
Now that we have seen the nature of the red wine quality, let’s explore the physicochemical input variables.
For the single variable analysis, I am going to plot a series of histograms. These histograms will show the distribution of each of the input variable.
From the input variable histograms above, we can see some interesting distributions. Let’s look at some of these histograms.
I am particularly interested in the following input variables:
Acidity is a fundamental property of wine, imparting sourness or tartness and resistance to microbial infection. Acidity of a wine helps to determine how finished wine will taste, how it feels in the mouth and how well it will age. A low acidity wine will taste flat and boring while too much acid can lead to tartness or a sour wine.
The predominant fixed acids found in wines are tartaric, malic, citric and succinic. All these acids originate in grapes with the exception of succinic acid. Succinic acid is produced during the fermentation process. Grapes are one of the rare fruits that contain tartaric acid.Tartaric is one of the strongest acids in wine and controls the acidity of wine. Tartaric Acid plays a critical role in the taste, feel and color of a wine. But even more important, it lowers the pH enough to kill undesirable bacteria, acting as a preservative.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
Fixed acidity is slightly skewed to the right. The slight positive skew has a long tail extending out to a max value of 15.90 g/dm^3. The median value is 7.90 g/dm^3 and the mean is 8.32 g/dm^3. There is a slight skew in the data because there are a few wines which has a very high fixed acidity.
Volatile acids are produced through microbial action such as yeast fermentation, malolactic fermentation and other fermatations carried out by spoilage organisms. The most prominent volatile acid in wine is acetic acid. Acetic acid bacteria require oxygen to grow, therefore, elimination of any air in wine containers and sulfur dioxide addition will limit their growth. Our palates are quite sensitive to the presence of volatile acids and for that reason winemakers try to keep their concentrations as low as possible.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
Volatile acidity has a bimodal distribution with peaks around 0.4 and 0.6 g/dm^3. The median volatile acidity is 0.52 g/dm^3, and the mean is 0.5278 g/dm^3.
Total acidity of a wine is the combined sum of fixed and volatile acids present. So what does total acidity tell us? When we have a glass of wine our mouth is largely unable to tell the difference between fixed and volatile acids. If there is an overwhelming quantity of any single acid, say citric, we may be able to pick out their contribution to the wine. In the case of citric acid, the wine may have citrus overtones to it. An over abundance of a volatile spoilage acid can give obvious flavors and aromas to us. But even so it must be out of balance for us to notice this one particular acid.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.120 7.680 8.445 8.847 9.740 16.285
Total acidity distribution looks similar to fixed acidity distribution. This is not surprising because volatile acid numbers are much smaller than the fixed acidity. The median total acidity is 8.445 g/dm^3. There is an increase of 0.545 g/dm^3 over the median fixed acidity.
Most, if not all of the citric acid naturally present in the grapes is consumed by bacteria during fermentation. The absence of citric acid would bring the fermentation process to a grinding halt, this almost never happens though.
Citric acid plays a major role in a winemakers influence on acidity. Many winemakers use citric acid to acidify wines that are too basic and as a flavor additive. This process has is benefits and drawbacks. Adding citric acid will give the wine “freshness” otherwise not present and will effectively make a wine more acidic. The major drawback is bacteria use citric acid in their metabolism, thus the citric acid added may just be consumed by bacteria, promoting the growth of unwanted microbes.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
##
## 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14
## 132 33 50 30 29 20 24 22 33 30 35 15 27 18 21
## 0.15 0.16 0.17 0.18 0.19 0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29
## 19 9 16 22 21 25 33 27 25 51 27 38 20 19 21
## 0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.4 0.41 0.42 0.43 0.44
## 30 30 32 25 24 13 20 19 14 28 29 16 29 15 23
## 0.45 0.46 0.47 0.48 0.49 0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59
## 22 19 18 23 68 20 13 17 14 13 12 8 9 9 8
## 0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69 0.7 0.71 0.72 0.73 0.74
## 9 2 1 10 9 7 14 2 11 4 2 1 1 3 4
## 0.75 0.76 0.78 0.79 1
## 1 3 1 1 1
The most common value was 0.00 (132 wines). The next common value was 0.49 (68 wines). Some of the higher values do not have any data.
Residual sugar is the sugar that remains in a wine after fermentation completes. Often the very first impression of a wine is in its level of sweetness. The greater the amount of residual sugar, the sweeter the wine. Residual sugar is balanced by acidity, alcohol and tannins in wine.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
A log10 scale was applied to the residual sugar to get a better visualization of the distribution. Residual sugar had a median of 2.20 g/dm^3 with a long tail that extended out to 15.50 g/dm^3
Chlorides is the amount of salt in wine.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
I transformed the long tail distribution with a log10 scale so it could be better visualized. After the transformation, the chlorides histogram appears normal, with some outliers on the right side and left side of the curve. Chlorides had a mean of 0.087 g/dm^3 and a median of 0.079 g/dm^3.
Free sulfur dioxide prevents micobial growth and the oxidation of wine.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
There are more wines in the dataset with low levels of free sulfur dioxide than those with more. On average wines contain 15.87 mg/dm^3 of free sulfur dioxide.
Total sulfur dioxide is the amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
This is the amount of free and bound forms of sulfur dioxide. Similar to free sulfur dioxide,the distribution of total sulfur dioxide is also positively skewed with few wines with extreme values of total sulfur dioxide. There are two large outliers in this dataset. The mean and median for total sulfur dioxide is 46.47 mg/dm^3 and 38.00 mg/dm^3 respectively.
The density of water is close to that of water depending on the percent alcohol and sugar content.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0037
Density of water in the wine is one of the few normally distributed variables in this dataset.The median and mean is roughly the same(0.99 g/cm^3).
pH describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
The pH has a normal distribution with a median of pH 3.3 and mean at pH 3.3. Both the mean and the median is about the same.
Sulphates is a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
The distribution of sulphates is positively skewed with a few outliers. The average amout of sulphates is 0.6 g/dm^3. We applied a log10 transformation to get a better visualization of the sulphate distribution.
Alcohol is the product of fermentation of the natural grape sugars by yeasts and without it wine simply doesn’t exist. The amount of sugar in the grapes determines what the final alcohol level will be. The conversion of sugar to alcohol is such a vital step in the process of making wine.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
The average percentage alcohol in wine for this dataset is around 10.42%. The alcohol values were skewed toward larger precentages, between 10.2% to 14.9%.
This tidy data set contains 1,599 red wines with 13 variables. 11 input variables were on the chemical properties of the wine. There was a one input variable called “X” which was an identifier and one output variable. All of them were numerical variable except for “X”and “quality”. “X”and “Quality” are integers.
At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).
Other observations include:
The main feature of interest in the data set is quality. I’d like to know which input features are best for determining the wine quality.
Fixed acidity, volatile acidity, citric acid, residual sugar and alcohol likely contribute to the quality of wine. After doing extensive research, I think acidity and alcohol probably contribute most to the wine quality. This based on the current univariate assumption. This may change in the bivariate and multivariate analysis.
Yes, I created a couple of variables. First, a new variable called ‘ratings’ was created to reflect the quality ranges based on the quality variable. The ranges are in an ordered factor with levels “Poor”, “Good” and “Excellent”. Next, I created a variable called ‘total.acidity’ by summing fixed and volatile acidities.
The data was already in a tidy format. Hence, there was no need for any additional formatting on the data. I did remove “X” variable from the data set. “X” variable represented the row number and was not required for the analysis. Volatile acidity had a bimodal distribution with peaks around 0.40 and 0.60 g/dm^3. Citric acid distribution was different in the sense that the most common value was 0.00. Some of the distribution was skewed to the right. As a result, a logarithmic transformation was applied to better interpret the data.
Let’s investigate the correlation between the variables and get a feel for potential relationships in the data.
I decided to use the correlation diagram to see the correlation coefficent between the input variables and output variable. The main purpose is to give a high level snapshot of the relationship of these variables.
Here are the results of correlation coefficient between the quality variable and the input variables.
| Input Variables | Pearson Correlation |
|---|---|
| fixed.acidity | 0.12 |
| volatile.acidity | -0.39 |
| citric.acid | 0.23 |
| residual.sugar | 0.01 |
| chlorides | -0.13 |
| free.sulfur.dioxide | -0.05 |
| total.sulfur.dioxide | -0.19 |
| density | -0.17 |
| pH | -0.06 |
| sulphates | 0.25 |
| alcohol | 0.48 |
| total.acidity | 0.09 |
There does not seem to be much strong correlation with quality. This is quite surprising to me. The only strongest correlation with quality was alcohol(0.48). There were other weak positive correlation such as sulphates(0.25),citric acid(0.23), fixed acidity(0.12) and residual sugar(0.01). I was also surprised that sulphates had an influence on quality. Volatile acidity(-0.39) on the other hand had a strong negative correlation towards quality.
Let’s look at the boxplots to see the relation between quality and the physiochemical variables. The following graphs represents boxplots between quality level [3-8] against each input variable.
The mean increases from quality level 4 to 8. Fixed Acidity has almost no effect on the Quality. The mean and median values of fixed acidity remains almost unchanged with the increase in quality. Fixed acidity has weak positive correlation with Quality.
The mean decreases from quality level 3 to 7, and increases a little bit to 8. Volatile acidity seem to have a negative relationship with Quality. There is a definite trend in lower volatile acidity levels as wine quality increases. We know from the background information that high levels of volatile acidity can cause the wine to taste like vinegar. This inverse relationship between volatile acidity and quality makes sense.
A combined sum of fixed acid and volatile acid gives total acidity of a wine. An essential trait in wine that’s necessary for quality. The mean increases from quality level 4 to 8. It has a similar pattern as fixed acidity. Total Acidity has almost no effect on the Quality. The mean and median values of total acidity remains almost unchanged with the increase in quality. Total acidity has weak positive correlation with Quality. This is quite surprising.
The mean remains the same from 3 to 4 but then it starts to increase from 5 to 8. Citric acid seem to have a positive correlation with Quality. Higher the citric acid the better the wine quality.
Residual sugar is the amount of sugar remaining after fermentation stops. The mean values for the residual sugar is almost the same for every quality of wine. Residual sugar has almost no correlation with Quality. We can conclude that residual sugar is about the same in all levels of quality. The sweetness of the wine is the same across all quality levels.
The mean significantly decreases from quality level 3 to 4, then slowly decreases all the way to 8. Even though it is a weak negative correlation, the box plots shows the lower the amount of chloride the better the quality of wine.
The mean increases from 3 to 5 and then gradually decreases from 5 to 8. Lower concentration of free sulfur dioxide seem to be prevalent more in poor and excellent wines. Higher concentration are found in good quality wine. Excellent quality wine seem to have a much lower free sulfur dioxide. Free sulfur dioxide has very weak negative correlation with Quality.
Total sulfur dioxide is the amount of free and bound forms of S02. The mean increases from 3 to 5 and then decreases from 5 to 8. It has the same pattern as the free sulfur dioxide. Low concentration of total sulfur dioxide seem to be prevalent in poor and excellent wines. Higher concentration are found in good wines. Total sulfur dioxide has very weak negative correlation with Quality.
The mean decreases from quality level 3 to 4 and then increases from 4 to 5. However, the quality level decreases from 5 to 8. It seems like lower densities produces better wines.
The mean remains the same between quality level 3 to 4. However, it decreases from 4 to 5. It increases slightly from 5 to 6. It then slowly decreases from 6 to 8. Better wines seem to have less pH. The lower the the pH number, the more intense the acids present in the wine.
The mean steadily increases from 3 to 8. Better quality wines have a stronger concentration of sulphates.
The mean increases from 3 to 4 and then decreases from 4 to 5. However, it sigificantly increases from 5 to 8. It is very clear that better quality wines has higher alcohol content.
Comparing the relationship between pH with total acidity.
From the plot above there is a negative linear relationship between total acidity and pH.
Comparing the relationship between pH with total acidity.
From the plot above there is a negative linear relationship between citric acid and pH. Total acidity and citric acid has a strong correlations with pH. The lower the pH number, the more intense the acidity in wines.
Residual sugar is the amount of sugar remaining after fermentation stops. It is sugar that are not converted into alcohol during fermentation. It is clear from the scatter plot above, most of the wines have residual sugar in it regardless of the level of alcohol content. In dry wine, yeasts consume almost all of the sugar from the grapes. In sweet wine, the yeasts are killed before all the sugar is used, leaving behind residual sugars. However, even wines that taste very dry will have some degree of residual sugar.
Density and total acidity has a positive linear relationship. If the wine has fixed acids that don’t evaporate readily then the wine is more dense.
The relationship between alcohol and density is negative. Alcohol is lighter than water. Hence, density decreases with increased alcohol. Fermentation is a natural process allowing the transformation of grape juice - the must - into wine. During fermentation, the density of the must progressively diminishes, until reaching a value from 0.990 to 0.995. Values greater than 1 mean the presence of sugar. During fermentation, the sugar in the juice is converted into alcohol. During the wine making process, the density of sugar is greater than the density of alcohol in water. The more sugar is consumed by the yeast, the more alcohol we get. The density of wine is primarily determined by the concentration of alcohol, sugar, glycerol, and other dissolved solids. Sweeter wines generally have higher densities.
Based on my research about wine, acidity, residual sugar, alcohol, tannin and body play a crucial role in wine quality. Hence, in the univariate analysis, I mainly focused on fixed acidity, volatile acidity, total acidity, citric acid, residual sugar and alcohol. However, in the bivariate analysis, I decided to analyze all the input variables against quality. There were some interesting discoveries. They are as follows: -
Yes, I did observe some interesting relationships between the other input variables. Here are some of my observation:
Total acidity and pH had a strong correlation. It was about -0.67. The higher the total acidity, the lesser the pH. This result was not surprising at all. This is because Malolactic fermentation takes place whereby the tart-tasting malic acid is converted to softer tasting lactic acid. This process increases the pH and lowers the acidity.
A comparison was done between residual sugar and alcohol. It has a weak positive correlation. During fermentation, yeasts metabolize sugars for energy, yielding alcohol as a major byproduct. In dry wine, yeasts consume almost all of the sugar from the grapes. In sweet wine, the yeasts are killed before all the sugar is used, leaving behind residual sugars. Regardless of the level of alcohol content, most wines will have some form of residual sugar.
Density and total acidity had a very strong correlation. It was about 0.68. If the wine has fixed acids that don’t evaporate readily then the wine is more dense. Acids present in the wine has densities greater than water.
Density and alcohol had an inverse relationship. Alcohol is lighter than water. Hence, density decreases with increased alcohol. During fermentation, the density of the must progressively diminishes, as alcohol is generated.
Features with the strongest relationships to wine quality are as follows:
Some of the correlations between other features showed strong relationships. They were between total acidity and density (0.68), followed by pH and fixed acidity (-0.68), pH and total acidity (-0.67), fixed acidity and citric acid (0.67), density and fixed acidity (0.67) and free sulfur dioxide and total sulfur dioxide (0.67).
Let’s do some multivariate plots of the following combinations of input variables with ratings quality:
We will use the ratings variable (Poor, Good & Excellent) created in the beginning of this report.
Low volatile acids and high citric acid produces excellent wines(red)
Low volatile acids and high sulphates produces excellent quality wines(red).
In the plot above, excellent wines (red) has low volatile acidity(y-axis) and high alcohol content(x-axis).
High citric acid and high sulphates produces excellent wine quality(red).
When citric acids and alcohol content are high then we have excellent wine quality(red).
High sulphates and high alcohol by volume leads to excellent wines(red).
As density decreases, alcohol increases. As a result, we get excellent quality wines(red).
When fixed acidity and citric acid increases, the quality of wine rises. The high quality wine is in red.
For this linear regression model, I used variables that had some good correlations with quality. They were alcohol, volatile acidity, sulphates, chlorides, pH, citric acid and density.
##
## Calls:
## m1: lm(formula = I(quality) ~ I(alcohol), data = wine)
## m2: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity, data = wine)
## m3: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity + log10(sulphates),
## data = wine)
## m4: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity + log10(sulphates) +
## log10(chlorides), data = wine)
## m5: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity + log10(sulphates) +
## log10(chlorides) + pH, data = wine)
## m6: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity + log10(sulphates) +
## log10(chlorides) + pH + citric.acid, data = wine)
## m7: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity + log10(sulphates) +
## log10(chlorides) + pH + citric.acid + density, data = wine)
##
## ======================================================================================================================
## m1 m2 m3 m4 m5 m6 m7
## ----------------------------------------------------------------------------------------------------------------------
## (Intercept) 1.875*** 3.095*** 3.369*** 3.069*** 4.225*** 4.842*** -6.479
## (0.175) (0.184) (0.184) (0.199) (0.360) (0.449) (11.909)
## I(alcohol) 0.361*** 0.314*** 0.303*** 0.282*** 0.295*** 0.302*** 0.312***
## (0.017) (0.016) (0.016) (0.017) (0.017) (0.017) (0.020)
## volatile.acidity -1.384*** -1.156*** -1.100*** -0.987*** -1.110*** -1.137***
## (0.095) (0.097) (0.098) (0.102) (0.115) (0.118)
## log10(sulphates) 1.477*** 1.713*** 1.690*** 1.742*** 1.715***
## (0.177) (0.187) (0.186) (0.187) (0.189)
## log10(chlorides) -0.491*** -0.612*** -0.564*** -0.573***
## (0.128) (0.131) (0.132) (0.133)
## pH -0.448*** -0.595*** -0.598***
## (0.116) (0.133) (0.133)
## citric.acid -0.276* -0.331*
## (0.121) (0.134)
## density 11.274
## (11.852)
## ----------------------------------------------------------------------------------------------------------------------
## R-squared 0.227 0.317 0.345 0.352 0.357 0.360 0.360
## adj. R-squared 0.226 0.316 0.344 0.350 0.355 0.357 0.357
## sigma 0.710 0.668 0.654 0.651 0.648 0.647 0.647
## F 468.267 370.379 280.646 216.010 177.270 148.984 127.822
## p 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -1721.057 -1621.814 -1587.752 -1580.357 -1572.954 -1570.342 -1569.887
## Deviance 805.870 711.796 682.108 675.828 669.599 667.415 667.035
## AIC 3448.114 3251.628 3185.503 3172.714 3159.908 3156.683 3157.774
## BIC 3464.245 3273.136 3212.389 3204.977 3197.548 3199.700 3206.168
## N 1599 1599 1599 1599 1599 1599 1599
## ======================================================================================================================
The R-squared value for the model was about 23% which was very low. R-squared values increased slightly for each addition of input variables. Based on the R-squared, I am able to predict 36% of what determines the quality of wine comes from the amount of alcohol and volatile acidity used. However, sulphates, chlorides and pH contributed to various degrees. Citric acid and density seem to not contribute very much.
Excellent wines had a rating of 7 and 8. From the multivariate plots, citric acid, sulphates and alcohol tend to increase the quality of wine. Citric acid and sulphates strengthened each other when looking at the wine quality. This is because they both appear positively associated with wine quality. In addition, citric acid and alcohol both maximize each other to get excellent wines. Citric acid and alcohol have a positive relationsip. Citric acid has a significant influence on fixed acidity as well.
I thought the most interesting interactions were volatile acidity and density. I never really thought that these would play a crucial role in the wine quality. However, from the plots above, volatile acidity and density does impact the quality of wine. One other surprising feature is sulphates. When I started the univariate analysis, I did not consider sulphates to affect quality. However, the multivariate analysis proved me wrong.
Yes, I did create a linear regression model. Given the dataset, it appears that it is rather difficult to predict the quality of wine. The R squared value is low. The most important predictor is alcohol. The model is still useful because it shows the importance of each input variables. The one limitation I see is the lack of variation in the dataset.
In this analysis, I tried to understand how quality of the red wine is determined by the input variables. I created many plots to see if I could detect any of the features affected the red wine quality. I’ll share three plots that perhaps stood out to me. These plots are derived from the analysis done above.
I started my investigation on the red wine with quality. I wanted to see the various range of quality present in this dataset. The wine quality range fell between 0 (poor) and 10 (excellent). The plot provided me with key information on where most wines fell in the quality range.
In this plot, I wanted to see the different levels of red wine quality. I was surprised to see that a large majority of them fell in either 5 or 6. As result, I grouped them in three different ratings, namely “Poor”, “Good” and “Excellent”. It was clear most wines were “Good”. Followed by “Good” wines were the “Excellent” wines. There were very few “Poor” wines. Also, from the plot above there aren’t any wine with 0, 1, 2, 9 and 10 quality score. Hence, this led me to ask more questions as to which variable impacted the different red wine quality levels.
When I plotted the bivariate plots, alcohol has the strongest correlation with red wine quality as compared to the other input variables. It was clear alcohol plays a crucial role in wine quality.
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.400 9.725 9.925 9.955 10.575 11.000
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.00 9.60 10.00 10.27 11.00 13.10
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.5 9.4 9.7 9.9 10.2 14.9
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.80 10.50 10.63 11.30 14.00
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.20 10.80 11.50 11.47 12.10 14.00
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.80 11.32 12.15 12.09 12.88 14.00
##
## Pearson's product-moment correlation
##
## data: wine$quality and wine$alcohol
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4373540 0.5132081
## sample estimates:
## cor
## 0.4761663
Alcohol was the strongest variable that had the most impact on the wine quality. As the alcoholic content increases, the quality of wine increases as well. Some key features of this plot are the median(12.15) of quality 8 is greater than the upper quartile(11.30) of quality 7, and the lower quartile(11.32) of quality 8 is greater than the upper quartile(11.30) of quality 6. These features emphasize that there is a tendency whereby high quality wine has high alcohol content in them. The strong relationship(0.48) between alcohol and wine quality shows a clear positive trend.
In the final plot, I wanted to show how the two strongest correlation coefficient, alcohol and volatile acidity made an impact on red wine quality.
Excellent quality red wines seem to have low volatile acidity and high alcohol content. This makes sense. Too much of volatile acids, such as acetic acids, begin to taste like vinegar or furniture polish. As a result, the wine will be undrinkable. The lesser the volatile acids the better the wine quality.
Alcohol is the product of fermentation of the natural grape sugars by yeasts, and without it wine simply doesn’t exist. Grapes mostly contain water and sugar. As the wine undergoes fermentation, the sugar is absorbed by the yeast. This process then creates the alcohol content in wine. As a result, grape juice turns into wine. High quality red wine generally have high alcohol content. Most red wines are high in alcohol. For example, Zinfandel, Shiraz and Madeira are high alcohol wines.
In this analysis, there were some difficulties. Most of the correlation coefficients had either a weak relationships or negligible relationships. This was an indication to me that perhaps the dataset is too small or there are some missing variables. Based on my wine research, there are many other variables that contributes to the wine quality. This includes but not limited to, the temperature, tannin levels, speed, oxygen levels, glyserol levels, grape quality, climate and etc. Even though alcohol had a major impact on wine quality, the statistics reveals alcohol only had moderate positive correlation. Hence, it felt like this dataset did not have the necessary variables to measure the red wine quality.
The next struggle I had was the quality variable itself. I had categorized the wine quality to “Poor”, “Good” and “Excellent”. Most wines fell in the “Good” ratings, which was 1319 wines. Our dataset had 1599 wines. So the “Good” rating wine had an extremely high number. In addition, there were not a single red wine that had a quality of 0,1,2,9 and 10. It seems odd to me why this was the case.
The next hardship I had was, I was not a wine drinker nor was I familiar with the wine tasting and wine making process. So I decided to read many articles, papers and blog posts about wine making process. I must say I learned a lot about red wine composition, fermentation, wine traits and vinification that helped me tremendously in analyzing this data set. This learning process took me awhile before I actually started my exploratory data analysis project.
Another difficulty I had was using R for explortory data analysis. I have never used R in data analysis. Hence, I went through each video and lesson in Udacity’s Explore and Summarize Data to educate myself in R statistical programming. I also used the instructor’s notes website links to learn more about R. It has been a learning curve in action for me. I learned about the different R packages, R markdown files, R scripts, how to quantify single, double and multi-variables, transforming data, ggplots, scatter plots, correlations and even linear regression model. This was a high learning curve but I trully enjoyed R.
When I started this project, I just wanted to focus on a few variables for the univariate plots. But then I realized that my curiosity led me to investigate all the other variables in bivariate and multivariate plots. After plotting the bivariate and multivariate plots, I decided to go back to the univariate plot and plot the rest of the input variables. I wanted to really see what actually affected the wine quality. The discovery was amazing. I discovered that the factors which affected the quality of the wine the most were alcohol, and volatile acids.
First, I noticed that some wines didn’t have citric acid at all. The most common value was 0.00 (132 wines). I thought that something is definitely not right with the dataset. I decided to do some research on wine and its relationship with citric acid. From my research, citric acid is actually added to some wines to increase the acidity. This is because citric acids add ‘freshness’ and flavor to wines. So it made sense to me that some wines would not have any citric acid at all because they were not added.
In my analysis, volatile acids had an inverse relationship with wine quality. This was unexpected. I thought all acids played an important role in the taste of wine and henced increased the wine quality. But it is not necessarily the case with volatile acids. Acetic acid is the most common acid found in volatile acids. Acetic acid is what gives the wine a sour vinegar taste. From my wine research, large quantities of acetic acid bacteria means, the wine is considered spoiled. Keeping this volatile acid to a minimum is important in the wine making process. The lesser the volatile acids, the better the quality of wine. One of the ways to reduce volatile acids is by adding sulfur dioxide to keep harmful bacteria in check. This brings us to the next variable, sulphate.
I was surprised that sulphates had an impact on wine quality. This was least anticipated when I was plotting the graphs. Sulphate is a wine additive which can contribute to sulfur dioxide gas levels. Sulfur dioxide acts as an antimicrobial that prevents microbial growth in wine and also an antioxidant that prevents oxidation of wine. Most wineries are likely to add sulfur to the macerated grapes and/or must. It protects the must from bacteria and mold that might have been transmitted to the grape clusters either in the vineyard or on the way to the winery. Sulfur is not added during fermentation. When the wine has fermented as much as it will, sulfur is then added to protect the wine through aging. The wine’s pH and alcohol level will contribute to how much sulfur is added prior to bottling. This was an interesting revelation for me.
In the bivariate and multivariate analysis, high alcohol content hands down was the variable that impacted the quality of the red wine. The crushed grapes/must sweet sugary juice is transformed into alcohol through fermentation. This is key in wine. Without it there is simply no wine. So it is clear that high quality wine will have greater alcohol content. This is evident in the bivariate and multivariate analysis which showed “Excellent” wines had high alcohol content.
For future analysis, I would love to have a dataset, with more input variables that would reflect the quality of wine. For example, variables such as temperature, tannin levels, speed, oxygen levels, glyserol levels, grape quality, climate, price, etc. would perhaps add more depth to the analysis. Another possible next step is to apply machine learning to provide more accurate predictions on red wine quality.
In conclusion, this course and the red wine analysis was a positive experience. I trully enjoyed the challenges it offered and solving them. R is a great tool for visualization and data exploration. I feel more confident in R then I was ever before.